Nemo task5 6 hooks and progress reporting by yayajjiang · Pull Request #9 · rlops/rlix

yayajjiang · 2026-04-26T22:47:35Z

No description provided.

… reporting Task 5 (F11-flag, F5-hooks): - Add RLixHooks Protocol with 7-method interface (before/after_generation, before/after_training, before_weight_sync, begin/end_progress_batch) - Add NoOpRLixHooks default implementation for standalone NeMo RL runs - Add NemoRLRLixHooks with GPU hook placeholders (Task 7 TODOs) and Task 3 NCCL placeholder in before_weight_sync - Add grpo.py stub with DO_TIME_SHARING flag and 5 hook insertion points in the correct per-step order Task 6 (F9): - Implement begin_progress_batch / end_progress_batch state machine - 2% bucket granularity (50 buckets), deduplicated fire-and-forget emit - _emit_progress() isolated as overridable method for testability - _emit_progress body is TODO pending Task 7 scheduler actor wiring Tests: 30 unit tests, all passing, no GPU/Ray/vLLM required Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tep_target_estimate bootstrap - grpo.py: begin_progress_batch now called before before_generation each step - TASK5_6_HOOKS.md: add chicken-and-egg problem explanation and two-mechanism solution - test: add test_begin_progress_batch_called_before_before_generation (31 tests total) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Gate 4 Three rounds of audit against rlix/scheduler/scheduler.py + RLix planner + existing MILES code surfaced 15 distinct P0 issues in the unified port plan that would block M11.2 Gate 4 (dual-pipeline) implementation. Bucket I — Init control flow vs RLix scheduler reality #1+#2 Reorder init: scheduler executes coordinator.resize_infer(add) RPC in Phase 5 before signaling pending GEN waiter (planner emits SchedGuidedAllocationOp even on first-pipeline alloc). All-shell RolloutManager + register_model_update_resources + publish_cache_ready_step(-1) + bootstrap_active_engines(empty) must run as Step 6.6a-f BEFORE Step 7's _request_cluster_gpus. Add Step 7.1-7.3 post-request validation. Empty Miles router started in all-shell ctor; activate_routing is sole add_worker entry; per-pipeline router port (no inheritance from peer). Coordinator gains _active_engines_bootstrapped: bool to distinguish empty-set bootstrap from "not yet bootstrap'd". Bucket II — F4/F5+6 weight-refresh internal contradictions #3 Drop per-bucket weight_version from sender invocation, wrapper payload, HTTP route metadata, and receiver API. Per-sync version publish is the atomic unit's step (c) — once via manager.set_weight_version(version, target_engines). #4 HTTP /update_weights_from_cpu_bucket route handler must enter tokenizer_manager.model_update_lock.writer_lock (mirrors existing update_weights_from_tensor path); active-refresh's "engine keeps serving" semantics need this pause boundary to prevent half-new weights being observed mid-load. #5 Add RolloutManager.get_engine_handles(engine_indices) read-only API; relax service-manager rule to "two read-only entries + one write" so service can dispatch per-engine RPCs. #6 Replace 5 split sender RPCs with composite cache_owner_actor.run_sync_session(plan: SyncSessionPlan). _cache_lock held in single method body, not cross-RPC. SyncSessionPlan is fully self-contained (target_handles + cpu_serialize_local_ranks + broadcast_local_ranks + comm_ranks), cache_owner never calls back to service or manager. #7 Replace dist.new_group with TCP rendezvous via existing MILES init_process_group helper (broadcast.py:137 pattern) on sender side and SGLang init_custom_process_group on receivers. Cross- process default-PG-less topology cannot use dist.new_group. Bucket III — Spec drift / API contradictions #8 F12 main body switched from get_rollout_workers() / worker_placements to get_all_rollout_engine_placements() (declared full table) + get_active_engine_indices(allocated_gpus, tp_size) (active subset with half-engine raise). #10 Gate 3 invariant (c) restated as initial_completed=0 contract, replacing stale "buffer ready count non-zero" wording that contradicts Fix #1. Bucket IV — TrainRayActor + milestone/Gate scope #9 TrainRayActor.__init__ accepts optional local_rank kwarg; RLix path passes local_rank=0 (1 actor 1 GPU under fractional + manual CVD; ray.get_gpu_ids() not in manual CVD list breaks get_local_gpu_id() ValueError). Standalone path keeps existing fallback. New file-change row for miles/ray/train_actor.py. #11 M11.2 unified-plan scope = Gate 4 happy path only (shell partial allocation + RolloutManager lazy ctor + Gate 4 (c)/(d)/(e) acceptance). admission_epoch / orchestrator cleanup / graceful drain stay M11.2-tagged follow-up — not Gate 4 pass/fail criteria. Manual `ray stop` is accepted recovery on crash. Implementation order Week 3 drops Gate 4; Week 4 picks up M11.2 happy path. #12 Strengthened Gate 4 acceptance: (c) router worker list contains exactly active engine subset; (d) v=-1 short-circuit verified via both run_sync_session and finalize_weight_update call counters staying at pre-expand values; (e) donor-shrink-before-recipient- add ordering with T1 < T2 < T3 strict inequality. Bucket V — Tail-end consequences of new init ordering #13 v=-1 service path skips run_sync_session, skips finalize fan-out, no cache_owner GPU touch. Required because Step 6 releases actor_train allocation before Step 7's GEN request — broadcast H2D staging on cache_owner would race the just-released GPU. Service runs only manager.set_weight_version(-1, target_engines). #14 F10 RLix-mode startup fail-fast on resume args: args.load in {None, args.hf_checkpoint} AND args.ref_load in {None, args.hf_checkpoint}. Fix #13's no-transport assumes train and rollout base weights are equivalent; resume / non-equivalent ref_load would diverge them. CPU-only base sync for resume is M11.2 follow-up; for now fail-fast. #15 v=-1 path also skips engine.finalize_weight_update — SGLang's onload-time process_weights_after_loading already ran during manager.expand_engines lazy-spawn; re-running on already-loaded weights is redundant. Cross-cutting cleanups - F4 _cache_lock invariant clarified as in-method (run_sync_session body), not cross-RPC. - F5+6 atomic-unit description rewritten to spell the (a)/(b)/(c) steps and the v=-1 short-circuit. - Receiver API table tightened: weight_version dropped from update_weights_from_cpu_bucket signature; finalize_weight_update row notes v=-1 skip. - SyncSessionPlan dataclass spec includes M11.2 narrowing note — Gate 4 happy-path acceptance runs cpu_serialize-only topology; NCCL broadcast complexity stays as M11.1 load-bearing structure (Gate 2.5 unchanged), M11.2 doesn't add transport work. - F12 placement provider rewrite + standalone path preserved. - Implementation follow-up table gains "M11.2 follow-up: CPU-only base sync at v=-1 for resume / non-equivalent ref_load" entry. No code changes; spec-only document. 858 insertions / 235 deletions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yayajjiang and others added 3 commits April 19, 2026 09:58

docs: add Task 5+6 summary with code structure and test cases

ae5e594

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Keep NeMo hook code out of rlix

8afc8b5

TianyeGGBond merged commit 62c9426 into rlops:nemo May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nemo task5 6 hooks and progress reporting#9

Nemo task5 6 hooks and progress reporting#9
TianyeGGBond merged 4 commits intorlops:nemofrom
yayajjiang:nemo-task5-6-hooks-and-progress-reporting

yayajjiang commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yayajjiang commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants